predict.loess and NA/NaN values

I am fitting loess models to subsets of data in order to use the loess predicitons for normalization (similar to what is done in many microarray analyses). While working on this I ran into a problem when I tried to predict from the loess models and the data contained NAs or NaNs. I tracked down the problem to the fact that predict.loess will not return a value at all when fed with such values. A toy example:

x <- rnorm(15)
y <- x + rnorm(15)
model.lm <- lm(y~x)
model.loess <- loess(y~x)
predict(model.lm, data.frame(x=c(0.5, Inf, -Inf, NA, NaN)))
predict(model.loess, data.frame(x=c(0.5, Inf, -Inf, NA, NaN)))

The behaviour of predict.lm meets my expectation: I get a vector of length 5 where the unpredictable ones are NA or NaN. predict.loess on the other hand returns only 3 values quietly skipping the last two. I was unable to find anything in the manual page that explains this behaviour or says how to change it. So I'm asking the community: Is there a way to fix this or do I have to code around it?


--
This is not much help, but I did a bit of digging by using

  debug(stats:::predict.loess)

And then step through the function line-by-line.  Apparently the Problem happens before the actual prediction is done.  The code

   as.matrix(model.frame(delete.response(terms(object)), newdata))

already omitted the NA and NaN.  The problem is that that's the default behavior of model.frame().  Consulting ?model.frame, I see that you can override this by setting the na.action attribute of the data frame passed to it.  Thus I tried setting

  na.dat = data.frame(x=c(0.5, Inf, -Inf, NA, NaN))
  attr(na.dat, "na.action") = na.pass

This does make the as.matrix(model.frame()) line retain the NA and
NaN, but it bombs in the prediction at the subsequent step.  I guess
It really doesn't like NA as inputs.

What you can do is patch the code to add the NAs back after the
Prediction step (which many predict() methods do).

Cheers,
Andy
    I want to do a  nonparametric regression. I’m using the function loess.
      The variable are  the year from 1968 to 1977 and the dependant variable is  a proportion P. The dependant variable have  missing value (NA).
      The script is :
       
      year <- 1969:2002
      length(year)
      [1] 34
       
      P <- c(NA,0.1,0.56,NA,NA,0.5,0.4,0.75,0.9,
      0.98,0.2,0.56,0.7,0.89,0.3,0.1,0.45,0.46,0.49,0.78,
      0.25,0.79,0.23,0.26,0.46,0.12,0.56,0.8,0.55,0.41,
      0.36,0.9,0.22,0.1)
      length(P)
      [1] 34
       
      lo1 <- loess(P~year,span=0.3,degree=1)
      summary(lo1)
       
      yearCo <- 1969:2002
      year_lo <- data.frame(year = yearCo )
      length(yea I get 1 here, and so should you.

>      mlo <- predict(loess(P~year,span=0.3,degree=1),new.data=year_lo,se=T)

It should be newdata, not new.data

>      mlo$fit
>      mlo$se.fit

Notice that these are of length 31, not 34

You are trying to predict at the values used for fitting (possibly not
what you intended), so you don't actually need this.  Try

lo1 <- loess(P~year,span=0.3,degree=1, na.action=na.exclude)
fitted(lo1)
plot(year,P,type='o')
lines(year, fitted(lo1))

Or if you want to try interpolation

lines(year, predict(lo1, newdata=year_lo))

This will not extrapolate to 1969, and as far as I recall the version of
loess in R does not allow extrapolation.

>      plot(year,P,type='o')
>      lines(year,predict(loess(P~year,span=0.15,degree=1),new.data=year_lo,
>      se=T,na.action=na.omit)$fit,col='blue',type='l')
>
>      The message error  indicates that x and y don’t have the same length.

>      In fact in m$fit and m$se.fit there are 3 values who don’t have a
> fitted value.

Correct, and that's because you used na.action=na.omit and did not specify
newdata.



--

r - loess prediction returns NA

I am struggling with "out-of-sample" prediction using loess. I get NA values for new x that are outside the original sample. Can I get these predictions?

x <- c(24,36,48,60,84,120,180)
y <- c(3.94,4.03,4.29,4.30,4.63,4.86,5.02)
lo <- loess(y~x)
x.all <- seq(3,200,3)
predict(object = lo,newdata = x.all)

I need to model full yield curve.



From the manual page of predict.loess:

When the fit was made using surface = "interpolate" (the default), predict.loess will not extrapolate – so points outside an axis-aligned hypercube enclosing the original data will have missing (NA) predictions and standard errors

If you change the surface parameter to "direct" you can extrapolate values.

For instance, this will work (on a side note: after plotting the prediction, my feeling is that you should increase the span parameter in the loess call a little bit):

lo <- loess(y~x, control=loess.control(surface="direct"))
predict(lo, newdata=x.all)


In addition to nico's answer: I would suggest to fit a gam (which uses penalized regression splines) instead. However, extrapolation is not advisable if you don't have a model based on science.

x <- c(24,36,48,60,84,120,180)
y <- c(3.94,4.03,4.29,4.30,4.63,4.86,5.02)
lo <- loess(y~x, control=loess.control(surface = "direct"))
plot(x.all <- seq(3,200,3),
     predict(object = lo,newdata = x.all),
     type="l", col="blue")
points(x, y)

library(mgcv)
fit <- gam(y ~ s(x, bs="cr", k=7, fx =FALSE), data = data.frame(x, y))
summary(fit)

lines(x.all, predict(fit, newdata = data.frame(x = x.all)), col="green")

resulting plot



loess predict with new x values

I am attempting to understand how the predict.loess function is able to compute new predicted values (y_hat) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point):

x <- 1:10
y <- x^2
mdl <- loess(y ~ x)
predict(mdl, 1.5)
[1] 2.25

loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y. However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat, for example, the span or degree. When I do predict(mdl, 1.5), how is predict able to produce a value at this new x? Is it interpolating between two nearest existing x values and their associated y_hat? If so, what are the details behind how it is doing this?



However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat

Maybe you have used print(mdl) command or simply mdl to see what the model mdl contains, but this is not the case. The model is really complicated and stores a big number of parameters.

To have an idea what's inside, you may use unlist(mdl) and see the big list of parameters in it.

This is a part of the manual of the command describing how it really works:

Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance). The size of the neighbourhood is controlled by α (set by span or enp.target). For α < 1, the neighbourhood includes proportion α of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3). For α > 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables.

For the default family, fitting is by (weighted) least squares. For family="symmetric" a few iterations of an M-estimation procedure with Tukey's biweight are used. Be aware that as the initial value is the least-squares fit, this need not be a very resistant fit.

What I believe is that it tries to fit a polynomial model in the neighborhood of every point (not just a single polynomial for the whole set). But the neighborhood does not mean only one point before and one point after, if I was implementing such a function I put a big weight on the nearest points to the point x, and lower weights to distal points, and tried to fit a polynomial that fits the highest total weight.

Then if the given x' for which height should be predicted is closest to point x, I tried to use the polynomial fitted on the neighborhoods of the point x - say P(x) - and applied it over x' - say P(x') - and that would be the prediction.

Thank you, yes, this is exactly what i describe in the question. Please note: "the fit at point x, the fit is made using points in a neighbourhood of x". the question is: what happens between x_1 and x_2.. at, for example, x_1 + epsilon that does not exist in the data-set

If every point (say x_1+epsilon) was in the dataset, what remained to be predicted? The other point is that we don't have just a single polynomial g(x), but say n polynomials g_1(x), g_2(x) ... g_n(x) such that g_i(x) is created to best fit the points in neighborhood of (x_i, y_i). Simply use the fitted polynomial to the closest point available in the dataset (say x_1) to predict it (So your answer would be g_1(x_1 + epsilon).